The Lexicon and MT: a position paper

نویسنده

  • Jeremy Clear
چکیده

The recent trend towards developing the lexical component of NLP systems has focussed attention on two potentially valuable sources of lexical data: printed dictionaries for humans and large text corpora. This presentation considers the types of information that might be required by MT researchers and the extent to which this information can be derived from these two sources. This raises a number of questions, among which are the following. What type of information should be recorded in the lexicon? Dictionaries are quite comprehensive in their coverage of lexical items but how reliable are they? How can the information from a dictionary be represented in a form which is appropriate for NLP systems? Text corpora can provide a statistical basis for probabilistic models of language: what are the requirements with respect to size and composition of text corpora for deriving lexical data? Can the manual effort which is currently directed towards the compilation of printed dictionaries provide spin-off benefits for those who need lexical databases for MT? 1. What type of information should be recorded in the lexi-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lexicon Exchange in MT - The Long Way to Standardization

LDV FORUM – Band 21(1) – 2006 Abstract Th is paper discusses the question to what extent lexicon exchange in MT has been standardized during the last years. Th e introductory section is followed by a brief description of OLIF2, a format specifi cally designed for the exchange of terminological and lexicographical data (Section 2). Section 3 contains an overview of the import/ export functionali...

متن کامل

Low-Density Language Bootstrapping: the Case of Tajiki Persian

Low-density languages raise difficulties for standard approaches to natural language processing that depend on large online corpora. Using Persian as a case study, we propose a novel method for bootstrapping MT capability for a low-density language in the case where it relates to a higher density variant. Tajiki Persian is a low-density language that uses the Cyrillic alphabet, while Iranian Pe...

متن کامل

Incorporation of a Valency Lexicon into a TectoMT Pipeline

In this paper, we focus on the incorporation of a valency lexicon into TectoMT system for Czech-Russian language pair. We demonstrate valency errors in MT output and describe how the introduction of a lexicon influenced the translation results. Though there was no impact on BLEU score, the manual inspection of concrete cases showed some improvement.

متن کامل

Universal Grammar and Lexis for Quick Ramp-Up of MT

This paper introduces Boas, a semi-automatic knowledge elicitation system that guides a team of two people through the process of developing the static knowledge sources for a moderate-quality, broad-coverage MT system from any "low-density" language into English in about six months. The paper focuses on some issues in the elicitation of descriptive knowledge in Boas and also the issue of the p...

متن کامل

Multilingual lexicons for related languages

The great increase in work on the lexicon by computational and theoretical linguists throughout the s has concerned itself almost exclusively with monolingual lexicons Meanwhile applied work on multilingual lexicons mostly for machine translation MT has employed monolingual lexicons linked only at the level of semantics In this paper we argue that the traditional MT lexicon architecture while a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011